Introduction

This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data
4. Exploring data
5. Tidy data

Course coordinates

spds.uni.kn

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## Essential commmands | Data science for psychologists
## 2018 06 24
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##

## Preparations: ----- 

library(tidyverse)

## Topic: ----- 

# ...

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- 

Basics

Some background knowledge and basics facts (required to learn R or any other programming language):

Concepts

  • Different types of objects: Code contains data values (objects being measured or manipulated) vs. functions (actions, operators, verbs).
  • Data formats: scalars, vectors, matrices (data frame or tibbles), arrays.
  • Data types: numeric, character, logical.
  • Accessing data formats (vectors, matrices) via names, indices, or subsetting.

Technology

  • Essentials: R, packages, code (commands vs. scripts).
  • Convencience: IDE like RStudio.
  • Infrastructure: R-project, stackoverflow, GitHub, …, Google.

Organization

  • Clean code (spacing, comments, sections, …)
  • Projects (with scripts, data, and folders)

Tibbles

Data can be found in the form of individual data points (so-called scalars, which can be of different types) or longer sequences of values (lists or vectors). However, most of the time we are dealing with datasets that contain multiple rows and columns (2-dimensional matrices or data frames, or multi-dimensional arrays).

Whenever working with rectangular data structures – data consisting of multiple cases (rows) and multiple variables (columns) – our first step in this course is to create or transform the data into a tibble. A tibble is defined by the package tibble and implements a particular type of data table (or a simpler version of a data frame, which is the most common data structure in R).

Creating tibbles

How we create tibbles depends on the form in which we encounter or obtain our data.

Basic commands

There are 3 basic commands for creating tibbles:

  1. as_tibble converts (or coerces) an existing data frame into a tibble.

  2. tibble converts several vectors into (the columns of) a tibble.

  3. tribble converts a table (entered row-by-row) into a tibble.

Check: The 3 commands yield the same type of output (i.e., a tibble), but require different inputs. Ask yourself which kind of input each command takes and how this input needs to be structured and formatted (e.g., with commas).

1. as_tibble

Use as_tibble when the data to be used already is in a data frame (or matrix):

## Using the data frame `sleep`: ------ 

# ?datasets::sleep # provides background information on the data set.

# Save the sleep data frame as df: 
df <- datasets::sleep

# Convert df into a tibble tb: 
tb <- as_tibble(df)

# Inspect the data frame df: 
dim(df)
#> [1] 20  3
is.data.frame(df)
#> [1] TRUE
head(df)
#>   extra group ID
#> 1   0.7     1  1
#> 2  -1.6     1  2
#> 3  -0.2     1  3
#> 4  -1.2     1  4
#> 5  -0.1     1  5
#> 6   3.4     1  6
str(df)
#> 'data.frame':    20 obs. of  3 variables:
#>  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#>  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

# Inspect the tibble tb:
dim(tb)
#> [1] 20  3
is.tibble(tb)
#> [1] TRUE
is.data.frame(tb) # => tibbles ARE data frames.
#> [1] TRUE
head(tb)
#> # A tibble: 6 x 3
#>   extra  group     ID
#>   <dbl> <fctr> <fctr>
#> 1   0.7      1      1
#> 2  -1.6      1      2
#> 3  -0.2      1      3
#> 4  -1.2      1      4
#> 5  -0.1      1      5
#> 6   3.4      1      6
glimpse(tb)
#> Observations: 20
#> Variables: 3
#> $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1....
#> $ group <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#> $ ID    <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, ...

Practice: Convert the data frames datasets::attitude and datasets::iris into tibbles and inspect their dimensions and contents. What types of variables do they contain?

2. tibble

Use tibble when the data to be used appears as a collection of columns. For instance, imagine we have the following information about a family:

Example data of some family.
id name age gender drives married_2
1 Adam 46 male TRUE Eva
2 Eva 48 female TRUE Adam
3 Xaxi 21 female FALSE Zenon
4 Yota 19 female TRUE NA
5 Zack 17 male FALSE NA

One way of viewing this table is as a series of columns. Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA).

The tibble command expects that each column of the table is entered as a vector:

## Create a tibble from vectors (column-by-column): 
fm <- tibble(
  id       = c(1, 2, 3, 4, 5), # OR: id = 1:5, 
  name     = c("Adam", "Eva", "Xaxi", "Yota", "Zack"), 
  age      = c(46, 48, 21, 19, 17), 
  gender   = c("male", rep("female", 3), "male"), 
  drives   = c(TRUE, TRUE, FALSE, TRUE, FALSE), 
  married_2 = c("Eva", "Adam", "Zenon", NA, NA)
  )

fm  # prints the tibble: 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

Note some details:

  • Each vector is labeled by the variable (column) name, which is not put into quotes;

  • Avoid spaces within variable (column) names (or enclose names in single quotes if you really must use spaces);

  • All vectors need to have the same length;

  • Each vector is of a single type (numeric, character, or Boolean truth values);

  • Consecutive vectors are separated by commas (but there is no comma after the final vector).

When using tibble, later vectors may use the values of earlier vectors:

# Using earlier vectors when defining later ones:
abc <- tibble(
  ltr = LETTERS[1:5],
  num = 1:5,
  l_n = paste(ltr, num, sep = "_"),  # combining abc with num
  nsq = num^2                        # squaring num
  )

abc  # prints the tibble: 
#> # A tibble: 5 x 4
#>     ltr   num   l_n   nsq
#>   <chr> <int> <chr> <dbl>
#> 1     A     1   A_1     1
#> 2     B     2   B_2     4
#> 3     C     3   C_3     9
#> 4     D     4   D_4    16
#> 5     E     5   E_5    25

Practice: Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble.

3. tribble

Use tribble when the data to be used appears as a collection of rows (or already is in tabular form).

For instance, when you copy and paste the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble to convert it into a tibble:

## Create a tibble from tabular data (row-by-row): 
fm2 <- tribble(
  ~id, ~name, ~age, ~gender, ~drives, ~married_2,   
  #--|------|-----|--------|----------|----------|
  1,  "Adam", 46,  "male",    TRUE,     "Eva",    
  2,  "Eva",  48,  "female",  TRUE,     "Adam",  
  3,  "Xaxi", 21,  "female",  FALSE,    "Zenon",    
  4,  "Yota", 19,  "female",  TRUE,      NA, 
  5,  "Zack", 17,  "male",    FALSE,     NA      )

fm2  # prints the tibble: 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

Note some details:

  • The column names are preceded by ~;

  • Consecutive entries are separated by a comma (but there is no comma after the final entry);

  • The line #--|------|-----|--------|----------|----------| is commented out and can be omitted;

  • The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in fm2 are missing character values because the entries above were characters (entered in quotes).

Check: If tibble and tribble really are alternative commands, then the contents of our objects fm and fm2 should be identical:

# Are fm and fm2 equal?
all.equal(fm, fm2)
#> [1] TRUE

Practice: Enter the tibble abc by using tribble.

Accessing parts of a tibble

Once we have an R object that is a tibble, we often want to access individual parts of it. We can distinguish between 3 simple cases:

1. Variables (columns)

As each column of a tibble is a vector, obtaining a column amounts to obtaining the corresponding vector. We can access this vector by its name (label) or by its number (column position):

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Get the name column of fm:
fm$name       # by label (with $)
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"
fm[["name"]]  # by label (with [])
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"
fm[[2]]       # by number (with [])
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"

# Get the age column of fm: 
fm$age        # by name (with $)
#> [1] 46 48 21 19 17
fm[["age"]]   # by name (with [])
#> [1] 46 48 21 19 17
fm[[3]]       # by number (with [])
#> [1] 46 48 21 19 17

# Note: The following all yield the same vectors as a tibble:
fm[ , 2] # yields the name vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack
select(fm, 2) 
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack
select(fm, name)
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack

fm[ , 3] # yields the age vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17
select(fm, 3)
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17
select(fm, age)
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17

Practice: Extract the price column of ggplot2::diamonds in at least 3 different ways and verify that they all yield the same mean price.

2. Cases (rows)

Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). The best way of filtering specific rows of a tibble is using dplyr::filter. However, it’s also possible to specify the desired rows by subsetting (i.e., specifying a condition that results in a Boolean value) and by row number:

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Filter specific rows (by condition):
filter(fm, id > 2)
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
filter(fm, age < 18)
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm %>% filter(drives == TRUE) 
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>
  
# The same filters by using Boolean vectors (subsetting):
fm[fm$id > 2, ]
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
fm[fm$age < 18, ]
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm[fm$drives == TRUE, ]
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>

# The same filters by providing specific row numbers:
fm[3:5, ]  # getting rows 3 to 5 of fm
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
fm[5, ]    # getting row 5 of fm
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm[c(1, 2, 4), ]  # getting rows 1, 2, and 4 of fm
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>

Practice: Extract all diamonds from ggplot2::diamonds that have at least 2 carat. How many of them are there and what is their mean price?

3. Cells

Accessing the values of individual tibble cells is relatively rare, but can be achieved by

a. explicitly providing both row number `r` and column number `c` (as `[r, c]`), or by  
b. first extracting the column (as a vector `v`) and then providing the desired row number `r` (`v[r]`). 
fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Getting specific cell values:
fm$name[4]  # getting the name of the 4th row
#> [1] "Yota"
fm[4, 2]    # getting the same name by row and column numbers
#> # A tibble: 1 x 1
#>    name
#>   <chr>
#> 1  Yota

# Note: What if we don't know the row number? 
which(fm$name == "Yota") # getting the row number that contains the name "Yota"
#> [1] 4

In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.

# Checking and changing cell values:

# Check: "Who is Xaxi's spouse?" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2
#> [1] "Zenon"
fm$married_2[3]
#> [1] "Zenon"
fm[3, 6]
#> # A tibble: 1 x 1
#>   married_2
#>       <chr>
#> 1     Zenon

# Change: "Zenon" is actually "Zeus" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2 <- "Zeus"
fm$married_2[3] <- "Zeus"
fm[3, 6] <- "Zeus"

# Check for successful change:
fm
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE      Zeus
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

By contrast, a relatively common task is to check an entire tibble for missing values, count them, or replace them by some other value:

# Checking for, counting, and changing missing values:

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE      Zeus
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# (a) Check for missing values:
is.na(fm)       # checks each cell value for being NA
#>         id  name   age gender drives married_2
#> [1,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [2,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [3,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [4,] FALSE FALSE FALSE  FALSE  FALSE      TRUE
#> [5,] FALSE FALSE FALSE  FALSE  FALSE      TRUE

# (b) Count the number of missing values: 
sum(is.na(fm))  # counts missing values (by adding up all TRUE values)
#> [1] 2

# (c) Change all missing values: 
fm[is.na(fm)] <- "A MISSING value!"

# Check for successful change: 
fm
#> # A tibble: 5 x 6
#>      id  name   age gender drives        married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>            <chr>
#> 1     1  Adam    46   male   TRUE              Eva
#> 2     2   Eva    48 female   TRUE             Adam
#> 3     3  Xaxi    21 female  FALSE             Zeus
#> 4     4  Yota    19 female   TRUE A MISSING value!
#> 5     5  Zack    17   male  FALSE A MISSING value!

Practice: Determine the number and the percentage of missing values in the datasets dplyr::starwars and dplyr::storms.

More advanced operations on tibbles are covered in Chapter 5: Data transformation and involve using the dplyr commands arrange, filter, and select.

More on tibbles

For more details on tibbles,

Data transformation

Overview

When we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data. Six essential dplyr commands are:

  1. arrange sorts cases (rows);
  2. filter selects cases (rows) by logical conditions;
  3. select selects and reorders variables (columns);
  4. mutate computes new variables (columns) and adds them to existing ones;
  5. summarise collapses multiple values of a variable (rows of a column) to a single one;
  6. group_by changes the unit of aggregation (in combination with mutate and summarise).

Not quite as essential but still useful dplyr commands include:

  • slice selects (ranges of) cases (rows) by number;
  • rename renames variables (columns) and keeps others;
  • transmute computes new variables (columns) and drops existing ones;
  • sample_n and sample_frac draw random samples of cases (rows).

Commands and examples

We save the dplyr::starwars data as a tibble sw and use it to illustrate the essential dplyr commands.

library(tidyverse)
sw <- dplyr::starwars

sw  # => A tibble: 87 rows (individuals) x 13 columns (variables)
#> # A tibble: 87 x 13
#>                  name height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8              R5-D4     97    32          <NA>  white, red       red
#>  9  Biggs Darklighter    183    84         black       light     brown
#> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Practice: How many sw variables (columns) are there and of which type are they? How many missing (NA) values are there?

1. arrange to sort rows

Using arrange sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending):

# Sort rows alphabetically (by name):
arrange(sw, name)
#> # A tibble: 87 x 13
#>                   name height  mass hair_color          skin_color
#>                  <chr>  <int> <dbl>      <chr>               <chr>
#>  1              Ackbar    180    83       none        brown mottle
#>  2          Adi Gallia    184    50       none                dark
#>  3    Anakin Skywalker    188    84      blond                fair
#>  4        Arvel Crynyd     NA    NA      brown                fair
#>  5         Ayla Secura    178    55       none                blue
#>  6 Bail Prestor Organa    191    NA      black                 tan
#>  7       Barriss Offee    166    50      black              yellow
#>  8                 BB8     NA    NA       none                none
#>  9      Ben Quadinaros    163    65       none grey, green, yellow
#> 10  Beru Whitesun lars    165    75      brown               light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# The same command using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  arrange(name) 
#> # A tibble: 87 x 13
#>                   name height  mass hair_color          skin_color
#>                  <chr>  <int> <dbl>      <chr>               <chr>
#>  1              Ackbar    180    83       none        brown mottle
#>  2          Adi Gallia    184    50       none                dark
#>  3    Anakin Skywalker    188    84      blond                fair
#>  4        Arvel Crynyd     NA    NA      brown                fair
#>  5         Ayla Secura    178    55       none                blue
#>  6 Bail Prestor Organa    191    NA      black                 tan
#>  7       Barriss Offee    166    50      black              yellow
#>  8                 BB8     NA    NA       none                none
#>  9      Ben Quadinaros    163    65       none grey, green, yellow
#> 10  Beru Whitesun lars    165    75      brown               light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# Sort rows in descending order:
sw %>% 
  arrange(desc(name)) 
#> # A tibble: 87 x 13
#>                     name height  mass   hair_color          skin_color
#>                    <chr>  <int> <dbl>        <chr>               <chr>
#>  1            Zam Wesell    168    55       blonde fair, green, yellow
#>  2                  Yoda     66    17        white               green
#>  3           Yarael Poof    264    NA         none               white
#>  4        Wilhuff Tarkin    180    NA auburn, grey                fair
#>  5 Wicket Systri Warrick     88    20        brown               brown
#>  6        Wedge Antilles    170    77        brown                fair
#>  7                 Watto    137    NA        black          blue, grey
#>  8            Wat Tambor    193    48         none         green, grey
#>  9            Tion Medon    206    80         none                grey
#> 10               Taun We    213    NA         none                grey
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# Sort by multiple variables:
sw %>% 
  arrange(eye_color, gender, desc(height))
#> # A tibble: 87 x 13
#>          name height  mass hair_color       skin_color eye_color
#>         <chr>  <int> <dbl>      <chr>            <chr>     <chr>
#>  1    Taun We    213    NA       none             grey     black
#>  2   Shaak Ti    178    57       none red, blue, white     black
#>  3    Lama Su    229    88       none             grey     black
#>  4 Tion Medon    206    80       none             grey     black
#>  5  Kit Fisto    196    87       none            green     black
#>  6   Plo Koon    188    80       none           orange     black
#>  7     Greedo    173    74       <NA>            green     black
#>  8  Nien Nunb    160    68       none             grey     black
#>  9    Gasgano    122    NA       none      white, blue     black
#> 10        BB8     NA    NA       none             none     black
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## Note: See 
# ?dplyr::arrange  # for more help and examples.

Note some details:

  • All basic dplyr commands can be called as verb(data, ...) or – using the pipe from magrittr – as data %>% verb(...) (see vignette("magrittr") for details).

  • Variable names are unquoted.

  • The order of variable names (x, y, ...) specifies the order or priority of operations (first by x, then by y, etc.).

Practice: Arrange the sw data in different ways, combining multiple variables and (ascending and descending) orders. Where are cases containing NA values in sorted variables placed?

2. filter to select rows

Using filter selects cases (rows) by logical conditions. It keeps all rows for which the conditions are TRUE and drops all rows for which the conditions are FALSE or NA.

# Filter to keep all humans:
filter(sw, species == "Human")
#> # A tibble: 35 x 13
#>                  name height  mass    hair_color skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>      <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond       fair      blue
#>  2        Darth Vader    202   136          none      white    yellow
#>  3        Leia Organa    150    49         brown      light     brown
#>  4          Owen Lars    178   120   brown, grey      light      blue
#>  5 Beru Whitesun lars    165    75         brown      light      blue
#>  6  Biggs Darklighter    183    84         black      light     brown
#>  7     Obi-Wan Kenobi    182    77 auburn, white       fair blue-gray
#>  8   Anakin Skywalker    188    84         blond       fair      blue
#>  9     Wilhuff Tarkin    180    NA  auburn, grey       fair      blue
#> 10           Han Solo    180    80         brown       fair     brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# The same command using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  filter(species == "Human")
#> # A tibble: 35 x 13
#>                  name height  mass    hair_color skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>      <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond       fair      blue
#>  2        Darth Vader    202   136          none      white    yellow
#>  3        Leia Organa    150    49         brown      light     brown
#>  4          Owen Lars    178   120   brown, grey      light      blue
#>  5 Beru Whitesun lars    165    75         brown      light      blue
#>  6  Biggs Darklighter    183    84         black      light     brown
#>  7     Obi-Wan Kenobi    182    77 auburn, white       fair blue-gray
#>  8   Anakin Skywalker    188    84         blond       fair      blue
#>  9     Wilhuff Tarkin    180    NA  auburn, grey       fair      blue
#> 10           Han Solo    180    80         brown       fair     brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter by multiple (additive) conditions: 
sw %>%
  filter(height > 180, mass <= 75)  # tall and light individuals
#> # A tibble: 3 x 13
#>            name height  mass hair_color  skin_color eye_color birth_year
#>           <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>
#> 1 Jar Jar Binks    196    66       none      orange    orange         52
#> 2    Adi Gallia    184    50       none        dark      blue         NA
#> 3    Wat Tambor    193    48       none green, grey   unknown         NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# The same command using the logical operator (&): 
sw %>%
  filter(height > 180 & mass <= 75)  # tall and light individuals
#> # A tibble: 3 x 13
#>            name height  mass hair_color  skin_color eye_color birth_year
#>           <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>
#> 1 Jar Jar Binks    196    66       none      orange    orange         52
#> 2    Adi Gallia    184    50       none        dark      blue         NA
#> 3    Wat Tambor    193    48       none green, grey   unknown         NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

# Filter for a range of a specific variable:
sw %>%
  filter(height >= 150, height <= 165)  # (a) using height twice
#> # A tibble: 9 x 13
#>                 name height  mass hair_color          skin_color eye_color
#>                <chr>  <int> <dbl>      <chr>               <chr>     <chr>
#> 1        Leia Organa    150    49      brown               light     brown
#> 2 Beru Whitesun lars    165    75      brown               light      blue
#> 3         Mon Mothma    150    NA     auburn                fair      blue
#> 4          Nien Nunb    160    68       none                grey     black
#> 5     Shmi Skywalker    163    NA      black                fair     brown
#> 6     Ben Quadinaros    163    65       none grey, green, yellow    orange
#> 7              Cordé    157    NA      brown               light     brown
#> 8              Dormé    165    NA      brown               light     brown
#> 9      Padmé Amidala    165    45      brown               light     brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

sw %>%
  filter(between(height, 150, 165))     # (b) using between(...)
#> # A tibble: 9 x 13
#>                 name height  mass hair_color          skin_color eye_color
#>                <chr>  <int> <dbl>      <chr>               <chr>     <chr>
#> 1        Leia Organa    150    49      brown               light     brown
#> 2 Beru Whitesun lars    165    75      brown               light      blue
#> 3         Mon Mothma    150    NA     auburn                fair      blue
#> 4          Nien Nunb    160    68       none                grey     black
#> 5     Shmi Skywalker    163    NA      black                fair     brown
#> 6     Ben Quadinaros    163    65       none grey, green, yellow    orange
#> 7              Cordé    157    NA      brown               light     brown
#> 8              Dormé    165    NA      brown               light     brown
#> 9      Padmé Amidala    165    45      brown               light     brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

# Filter by multiple (alternative) conditions: 
sw %>%
  filter(homeworld == "Kashyyyk" | skin_color == "green")
#> # A tibble: 8 x 13
#>                name height  mass hair_color skin_color eye_color
#>               <chr>  <int> <dbl>      <chr>      <chr>     <chr>
#> 1         Chewbacca    228   112      brown    unknown      blue
#> 2            Greedo    173    74       <NA>      green     black
#> 3              Yoda     66    17      white      green     brown
#> 4             Bossk    190   113       none      green       red
#> 5        Rugor Nass    206    NA       none      green    orange
#> 6         Kit Fisto    196    87       none      green     black
#> 7 Poggle the Lesser    183    80       none      green    yellow
#> 8           Tarfful    234   136      brown      brown      blue
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

# Filter cases with missing (NA) values on specific variables:
sw %>%
  filter(is.na(gender))
#> # A tibble: 3 x 13
#>    name height  mass hair_color  skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>  <chr>
#> 1 C-3PO    167    75       <NA>        gold    yellow        112   <NA>
#> 2 R2-D2     96    32       <NA> white, blue       red         33   <NA>
#> 3 R5-D4     97    32       <NA>  white, red       red         NA   <NA>
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter cases with existing (non-NA) values on specific variables:
sw %>%
  filter(!is.na(mass), !is.na(birth_year))
#> # A tibble: 36 x 13
#>                  name height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8  Biggs Darklighter    183    84         black       light     brown
#>  9     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> 10   Anakin Skywalker    188    84         blond        fair      blue
#> # ... with 26 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## Note: See 
# ?dplyr::filter  # for more help and examples.

Note some details:

  • Separating multiple conditions by commas is the same as the logical AND (&).

  • Variable names are unquoted.

  • The comma between conditions or tests (x, y, ...) means the same as & (logical AND), as each test results in a vector of Boolean values.

  • Unlike in base R, rows for which the condition evaluates to NA are dropped.

  • Additional filter functions include near() for testing numerical (near-)identity.

Practice: Use filter on sw to select very diverse or narrow subsets of individuals. For instance,

  • which individual with blond hair and blue eyes has an unknown mass?
  • of which species are individuals that are over 2m tall and have brown hair?
  • which individuals from Tatooine are not male (but may be NA)?
  • which individuals are neither male nor female OR heavier than 130kg?

3. select to select columns

Using select selects variables (columns) by their names or numbers:

# Select 4 specific variables (columns) of sw:
select(sw, name, species, birth_year, gender)
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  select(name, species, birth_year, gender)
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when providing a vector of variable names: 
sw %>%
  select(c(name, species, birth_year, gender)) 
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when providing column numbers:
sw %>%
  select(1, 10, 7, 8) 
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# The same when providing a vector of column numbers: 
sw %>%
  select(c(1, 10, 7, 8)) 
#> # A tibble: 87 x 4
#>                  name species birth_year gender
#>                 <chr>   <chr>      <dbl>  <chr>
#>  1     Luke Skywalker   Human       19.0   male
#>  2              C-3PO   Droid      112.0   <NA>
#>  3              R2-D2   Droid       33.0   <NA>
#>  4        Darth Vader   Human       41.9   male
#>  5        Leia Organa   Human       19.0 female
#>  6          Owen Lars   Human       52.0   male
#>  7 Beru Whitesun lars   Human       47.0 female
#>  8              R5-D4   Droid         NA   <NA>
#>  9  Biggs Darklighter   Human       24.0   male
#> 10     Obi-Wan Kenobi   Human       57.0   male
#> # ... with 77 more rows

# Select ranges of variables with ":":
sw %>%
  select(name:mass, films:starships)
#> # A tibble: 87 x 6
#>                  name height  mass     films  vehicles starships
#>                 <chr>  <int> <dbl>    <list>    <list>    <list>
#>  1     Luke Skywalker    172    77 <chr [5]> <chr [2]> <chr [2]>
#>  2              C-3PO    167    75 <chr [6]> <chr [0]> <chr [0]>
#>  3              R2-D2     96    32 <chr [7]> <chr [0]> <chr [0]>
#>  4        Darth Vader    202   136 <chr [4]> <chr [0]> <chr [1]>
#>  5        Leia Organa    150    49 <chr [5]> <chr [1]> <chr [0]>
#>  6          Owen Lars    178   120 <chr [3]> <chr [0]> <chr [0]>
#>  7 Beru Whitesun lars    165    75 <chr [3]> <chr [0]> <chr [0]>
#>  8              R5-D4     97    32 <chr [1]> <chr [0]> <chr [0]>
#>  9  Biggs Darklighter    183    84 <chr [1]> <chr [0]> <chr [1]>
#> 10     Obi-Wan Kenobi    182    77 <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows

# Select to re-order variables (columns) with everything():
sw %>%
  select(species, name, gender, everything())
#> # A tibble: 87 x 13
#>    species               name gender height  mass    hair_color
#>      <chr>              <chr>  <chr>  <int> <dbl>         <chr>
#>  1   Human     Luke Skywalker   male    172    77         blond
#>  2   Droid              C-3PO   <NA>    167    75          <NA>
#>  3   Droid              R2-D2   <NA>     96    32          <NA>
#>  4   Human        Darth Vader   male    202   136          none
#>  5   Human        Leia Organa female    150    49         brown
#>  6   Human          Owen Lars   male    178   120   brown, grey
#>  7   Human Beru Whitesun lars female    165    75         brown
#>  8   Droid              R5-D4   <NA>     97    32          <NA>
#>  9   Human  Biggs Darklighter   male    183    84         black
#> 10   Human     Obi-Wan Kenobi   male    182    77 auburn, white
#> # ... with 77 more rows, and 7 more variables: skin_color <chr>,
#> #   eye_color <chr>, birth_year <dbl>, homeworld <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Select variables with helper functions:
sw %>%
  select(starts_with("s"))
#> # A tibble: 87 x 3
#>     skin_color species starships
#>          <chr>   <chr>    <list>
#>  1        fair   Human <chr [2]>
#>  2        gold   Droid <chr [0]>
#>  3 white, blue   Droid <chr [0]>
#>  4       white   Human <chr [1]>
#>  5       light   Human <chr [0]>
#>  6       light   Human <chr [0]>
#>  7       light   Human <chr [0]>
#>  8  white, red   Droid <chr [0]>
#>  9       light   Human <chr [1]>
#> 10        fair   Human <chr [5]>
#> # ... with 77 more rows

sw %>%
  select(ends_with("s"))
#> # A tibble: 87 x 5
#>     mass species     films  vehicles starships
#>    <dbl>   <chr>    <list>    <list>    <list>
#>  1    77   Human <chr [5]> <chr [2]> <chr [2]>
#>  2    75   Droid <chr [6]> <chr [0]> <chr [0]>
#>  3    32   Droid <chr [7]> <chr [0]> <chr [0]>
#>  4   136   Human <chr [4]> <chr [0]> <chr [1]>
#>  5    49   Human <chr [5]> <chr [1]> <chr [0]>
#>  6   120   Human <chr [3]> <chr [0]> <chr [0]>
#>  7    75   Human <chr [3]> <chr [0]> <chr [0]>
#>  8    32   Droid <chr [1]> <chr [0]> <chr [0]>
#>  9    84   Human <chr [1]> <chr [0]> <chr [1]>
#> 10    77   Human <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows

sw %>%
  select(contains("_"))
#> # A tibble: 87 x 4
#>       hair_color  skin_color eye_color birth_year
#>            <chr>       <chr>     <chr>      <dbl>
#>  1         blond        fair      blue       19.0
#>  2          <NA>        gold    yellow      112.0
#>  3          <NA> white, blue       red       33.0
#>  4          none       white    yellow       41.9
#>  5         brown       light     brown       19.0
#>  6   brown, grey       light      blue       52.0
#>  7         brown       light      blue       47.0
#>  8          <NA>  white, red       red         NA
#>  9         black       light     brown       24.0
#> 10 auburn, white        fair blue-gray       57.0
#> # ... with 77 more rows

sw %>%
  select(matches("or"))
#> # A tibble: 87 x 4
#>       hair_color  skin_color eye_color homeworld
#>            <chr>       <chr>     <chr>     <chr>
#>  1         blond        fair      blue  Tatooine
#>  2          <NA>        gold    yellow  Tatooine
#>  3          <NA> white, blue       red     Naboo
#>  4          none       white    yellow  Tatooine
#>  5         brown       light     brown  Alderaan
#>  6   brown, grey       light      blue  Tatooine
#>  7         brown       light      blue  Tatooine
#>  8          <NA>  white, red       red  Tatooine
#>  9         black       light     brown  Tatooine
#> 10 auburn, white        fair blue-gray   Stewjon
#> # ... with 77 more rows

# Renaming variables:
sw %>%
  rename(creature = name, from_planet = homeworld)
#> # A tibble: 87 x 13
#>              creature height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8              R5-D4     97    32          <NA>  white, red       red
#>  9  Biggs Darklighter    183    84         black       light     brown
#> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, from_planet <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## Note: See 
# ?dplyr::select  # for more help and examples. 
?dplyr::select_if  # for more help and examples. 

Note some details:

  • select works both by specifying variable (column) names and by specifying column numbers.

  • Variable names are unquoted.

  • The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.

  • Selecting and adding everything() allows re-ordering.

  • Various helper functions (e.g., starts_with, ends_with, contains, matches, num_range) refer to (parts of) variable names.

  • rename renames specified variables (without quotes) and keeps all other variables.

Practice: Use select on sw to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).

4. mutate to compute new variables

Using mutate computes new variables (columns) from scratch or existing ones:

# Preparation: Save only a subset variables of sw as sws:   
sws <- select(sw, name:mass, birth_year:species) 
sws    # => 87 cases (rows), but only 7 variables (columns)
#> # A tibble: 87 x 7
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows

# Compute 2 new variables and add them to existing ones:
mutate(sws, id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>

# The same using the pipe:
sws %>%
  mutate(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>

# Transmute commputes and only keeps new variables:
sws %>%
  transmute(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 2
#>       id height_feet
#>    <int>       <dbl>
#>  1     1    5.643045
#>  2     2    5.479003
#>  3     3    3.149606
#>  4     4    6.627297
#>  5     5    4.921260
#>  6     6    5.839895
#>  7     7    5.413386
#>  8     8    3.182415
#>  9     9    6.003937
#> 10    10    5.971129
#> # ... with 77 more rows

# Compute variables based on multiple others (including computed ones):
sws %>%
  mutate(BMI = mass / ((height / 100)  ^ 2),  # compute body mass index (kg/m^2)
         BMI_low  = BMI < 18.5,               # classify low BMI values
         BMI_high = BMI > 30,                 # classify high BMI values
         BMI_norm = !BMI_low & !BMI_high      # classify normal BMI values 
         )
#> # A tibble: 87 x 11
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 4 more variables: BMI <dbl>, BMI_low <lgl>,
#> #   BMI_high <lgl>, BMI_norm <lgl>

## Note: See 
# ?dplyr::mutate  # for more help and examples. 

Note some details:

  • mutate computes new variables (columns) and adds them to existing ones, while transmute drops existing ones.

  • Each mutate command specifies a new variable name (without quotes), followed by = and a rule for computing the new variable from existing ones.

  • Variable names are unquoted.

  • Multiple mutate steps are separated by commas, each of which creates a new variable.

  • See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.

Practice: Compute a new variable mass_pound from mass (in kg) and the age of each individual in sw relative to Yoda’s age. (Note that the variable birth_year is provided in years BBY, i.e., Before Battle of Yavin.)

5. summarise to compute summaries

summarise computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value. It provides many different summary statistics by itself, but is even more useful in combination with group_by (discussed next).

# Summarise allows computing a function for a variable (column): 
summarise(sw, mn_mass = mean(mass, na.rm = TRUE))  # => 97.31 kg 
#> # A tibble: 1 x 1
#>    mn_mass
#>      <dbl>
#> 1 97.31186

# The same using the pipe: 
sw %>%
  summarise(mn_mass = mean(mass, na.rm = TRUE))  # => 97.31 kg 
#> # A tibble: 1 x 1
#>    mn_mass
#>      <dbl>
#> 1 97.31186

# Multiple summarise steps allow applying 
# different functions for 1 dependent variable: 
sw %>%
  summarise(n_mass = sum(!is.na(mass)), 
            mn_mass = mean(mass, na.rm = TRUE),
            md_mass = median(mass, na.rm = TRUE),
            sd_mass = sd(mass, na.rm = TRUE),
            max_mass = max(mass, na.rm = TRUE),
            big_mass = any(mass > 1000)
            )
#> # A tibble: 1 x 6
#>   n_mass  mn_mass md_mass  sd_mass max_mass big_mass
#>    <int>    <dbl>   <dbl>    <dbl>    <dbl>    <lgl>
#> 1     59 97.31186      79 169.4572     1358     TRUE
            
# Multiple summarise steps also allow applying 
# different functions to different dependent variables: 
sw %>%
  summarise(# Descriptives of height:  
            n_height = sum(!is.na(height)), 
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE), 
            # Descriptives of mass:
            n_mass = sum(!is.na(mass)), 
            mn_mass = mean(mass, na.rm = TRUE),
            sd_mass = sd(mass, na.rm = TRUE),
            # Counts of character variables:
            n_names = n(), 
            n_species = n_distinct(species),
            n_worlds = n_distinct(homeworld)
            )
#> # A tibble: 1 x 9
#>   n_height mn_height sd_height n_mass  mn_mass  sd_mass n_names n_species
#>      <int>     <dbl>     <dbl>  <int>    <dbl>    <dbl>   <int>     <int>
#> 1       81   174.358  34.77043     59 97.31186 169.4572      87        38
#> # ... with 1 more variables: n_worlds <int>

## Note: See 
# ?dplyr::summarise  # for more help and examples. 

Note some details:

  • summarise collapses multiple values into one value and returns a new tibble with as many rows as values computed.

  • Each summarise step specifies a new variable name (without quotes), followed by =, and a function for computing the new variable from existing ones.

  • Multiple summarise steps are separated by commas.

  • Variable names are unquoted.

  • See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with summarise.

Practice: Apply all summary functions mentioned in ?dplyr::summarise to the sw dataset.

6. group_by to aggregate variables

Using group_by does not change the data, but the unit of aggregation for other commands, which is very useful in combination with mutate and summarise.

# Grouping does not change the data, but lists its groups: 
group_by(sws, species)  # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups:   species [38]
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows

# The same using the pipe: 
sws %>%
  group_by(species)  # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups:   species [38]
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows

# group_by is ineffective by itself, but very powerful 
# (a) in combination with `mutate` and 
# (b) in combination with `summarise`. 

# ad (a):
# In combination with mutate and an aggregation function, 
# group_by changes the unit of aggregation:

sws %>%
  mutate(mn_height_1 = mean(height, na.rm = TRUE)) %>%  # aggregates over ALL cases
  group_by(species) %>%
  mutate(mn_height_2 = mean(height, na.rm = TRUE)) %>%  # aggregates over current group (species)
  group_by(gender) %>%
  mutate(mn_height_3 = mean(height, na.rm = TRUE)) %>%  # aggregates over current group (gender)
  group_by(name) %>%
  mutate(mn_height_4 = mean(height, na.rm = TRUE))      # aggregates over current group (name)
#> # A tibble: 87 x 11
#> # Groups:   name [87]
#>                  name height  mass birth_year gender homeworld species
#>                 <chr>  <int> <dbl>      <dbl>  <chr>     <chr>   <chr>
#>  1     Luke Skywalker    172    77       19.0   male  Tatooine   Human
#>  2              C-3PO    167    75      112.0   <NA>  Tatooine   Droid
#>  3              R2-D2     96    32       33.0   <NA>     Naboo   Droid
#>  4        Darth Vader    202   136       41.9   male  Tatooine   Human
#>  5        Leia Organa    150    49       19.0 female  Alderaan   Human
#>  6          Owen Lars    178   120       52.0   male  Tatooine   Human
#>  7 Beru Whitesun lars    165    75       47.0 female  Tatooine   Human
#>  8              R5-D4     97    32         NA   <NA>  Tatooine   Droid
#>  9  Biggs Darklighter    183    84       24.0   male  Tatooine   Human
#> 10     Obi-Wan Kenobi    182    77       57.0   male   Stewjon   Human
#> # ... with 77 more rows, and 4 more variables: mn_height_1 <dbl>,
#> #   mn_height_2 <dbl>, mn_height_3 <dbl>, mn_height_4 <dbl>

# ad (b):
# group_by is particularly useful in combination 
# with summarise:

sws %>%
  group_by(homeworld) %>%
  summarise(count = n(),
            mn_height = mean(height, na.rm = TRUE),
            mn_mass = mean(mass, na.rm = TRUE)
            )
#> # A tibble: 49 x 4
#>         homeworld count mn_height mn_mass
#>             <chr> <int>     <dbl>   <dbl>
#>  1       Alderaan     3  176.3333    64.0
#>  2    Aleen Minor     1   79.0000    15.0
#>  3         Bespin     1  175.0000    79.0
#>  4     Bestine IV     1  180.0000   110.0
#>  5 Cato Neimoidia     1  191.0000    90.0
#>  6          Cerea     1  198.0000    82.0
#>  7       Champala     1  196.0000     NaN
#>  8      Chandrila     1  150.0000     NaN
#>  9   Concord Dawn     1  183.0000    79.0
#> 10       Corellia     2  175.0000    78.5
#> # ... with 39 more rows

# Note that this pipe returns a new tibble, 
# with 49 rows (= different levels of homeworld) and 
# - 1 column of the group variable (homeworld) and 
# - 3 columns of the 3 newly summarised variables.


# group_by used with multiple variables yields a tibble 
# containing the combination of all variable levels: 
sw %>%
  group_by(hair_color, eye_color)  # => 35 groups (combinations)
#> # A tibble: 87 x 13
#> # Groups:   hair_color, eye_color [35]
#>                  name height  mass    hair_color  skin_color eye_color
#>                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75          <NA>        gold    yellow
#>  3              R2-D2     96    32          <NA> white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8              R5-D4     97    32          <NA>  white, red       red
#>  9  Biggs Darklighter    183    84         black       light     brown
#> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Counting the frequency of cases in groups:
sw %>%
  group_by(hair_color, eye_color) %>%
  count() %>%
  arrange(desc(n))  
#> # A tibble: 35 x 3
#> # Groups:   hair_color, eye_color [35]
#>    hair_color eye_color     n
#>         <chr>     <chr> <int>
#>  1      black     brown     9
#>  2      brown     brown     9
#>  3       none     black     9
#>  4      brown      blue     7
#>  5       none    orange     7
#>  6       none    yellow     6
#>  7      blond      blue     3
#>  8       none      blue     3
#>  9       none       red     3
#> 10      black      blue     2
#> # ... with 25 more rows

# The same using summarise:
sw %>%
  group_by(hair_color, eye_color) %>%
  summarise(n = n()) %>%
  arrange(desc(n))  
#> # A tibble: 35 x 3
#> # Groups:   hair_color [13]
#>    hair_color eye_color     n
#>         <chr>     <chr> <int>
#>  1      black     brown     9
#>  2      brown     brown     9
#>  3       none     black     9
#>  4      brown      blue     7
#>  5       none    orange     7
#>  6       none    yellow     6
#>  7      blond      blue     3
#>  8       none      blue     3
#>  9       none       red     3
#> 10      black      blue     2
#> # ... with 25 more rows

## Note: See 
# ?dplyr::group_by  # for more help and examples. 

Note some details:

  • group_by changes the unit of aggregation for other commands (mutate and summarise).

  • Variable names are unquoted.

  • When using group_by with multiple variables, they are separated by commas.

  • Using group_by with mutate results in a tibble that has the same number of cases (rows) as the original tibble. By contrast, using group_by with summarise results in a new tibble with all combinations of variable levels as its cases (rows).

Practice: Create some groups and compute descriptive statistics (n, mean, median, standard deviation, …) for some variables. For instance,

  • What is the number and mean height and mass of individuals from Tatooine by species and gender?

  • Which humans are more than 5cm taller then the average human overall?

  • Which humans are more than 5cm taller than the average human of their own gender?

Combining commands

The essential dplyr commands are quite simple by themselves, but form the basic verbs of a language for data manipulation. The commands become particularly powerful when they are combined into pipes (by using %>%). Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly.

Practice: Tidyverse meets universe

Answer the following questions about the dplyr::starwars dataset by using pipes of essential dplyr commands:

a. Basics:

  • Save the tibble dplyr::starwars as sw and report its dimensions.

b. Missing values and known unknowns:

  • How many missing (NA) values does sw contain?

  • Which individuals come from an unknown (missing) homeworld but have a known birth_year or known mass?

c. Gender issues:

  • How many humans are contained in sw overall and by gender?

  • How many and which individuals in sw are neither male nor female?

  • Of which species in sw exist at least 2 different gender values?

d. Popular homes and heights:

  • From which homeworld do the most indidividuals (rows) come from?

  • What is the mean height of all individuals with orange eyes from the most popular homeworld?

e. Size and mass issues:

  • Compute the median, mean, and standard deviation of height for all droids.

  • Compute the average height and mass by species and save the result as h_m.

  • Sort h_m to list the 3 species with the smallest individuals (in terms of mean height).

  • Sort h_m to list the 3 species with the heaviest individuals (in terms of median mass).

f. Counting and arranging:

  • How many individuals exist of the three most frequent (known) species?

g. Grouped mutates:

  • Which individuals are more than 20% lighter than the average mass of individuals of their own homeworld?
# library(tidyverse)
# ?dplyr::starwars

## (a) Basic data properties: ---- 
sw <- dplyr::starwars
dim(sw)  # => 87 rows (denoting individuals) x 13 columns (variables) 
#> [1] 87 13

## (b) Missing data: ----- 

## (+) How many missing data points?
sum(is.na(sw))  # => 101 missing values.
#> [1] 101

# (+) Which individuals come from an unknown (missing) homeworld 
#     but have a known birth_year or mass? 
sw %>% 
  filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#>           name height  mass hair_color skin_color eye_color birth_year
#>          <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
#> 1         Yoda     66    17      white      green     brown        896
#> 2        IG-88    200   140       none      metal       red         15
#> 3 Qui-Gon Jinn    193    89      brown       fair      blue         92
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>


## (x) Which variable (column) has the most missing values?
colSums(is.na(sw))  # => birth_year has 44 missing values
#>       name     height       mass hair_color skin_color  eye_color 
#>          0          6         28          5          0          0 
#> birth_year     gender  homeworld    species      films   vehicles 
#>         44          3         10          5          0          0 
#>  starships 
#>          0
colMeans(is.na(sw)) #    (amounting to 50.1% of all cases). 
#>       name     height       mass hair_color skin_color  eye_color 
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000 
#> birth_year     gender  homeworld    species      films   vehicles 
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000 
#>  starships 
#> 0.00000000

## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald": 
# sw$hair_color[is.na(sw$hair_color)] <- "bald"


## (c) Gender issues: ----- 

# (+) How many humans are there of each gender?
sw %>% 
  filter(species == "Human") %>%
  group_by(gender) %>%
  count()
#> # A tibble: 2 x 2
#> # Groups:   gender [2]
#>   gender     n
#>    <chr> <int>
#> 1 female     9
#> 2   male    26

## Answer: 35 Humans in total: 9 females, 26 male.

# (+) How many and which individuals are neither male nor female?
sw %>% 
  filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#>                    name height  mass hair_color       skin_color eye_color
#>                   <chr>  <int> <dbl>      <chr>            <chr>     <chr>
#> 1 Jabba Desilijic Tiure    175  1358       <NA> green-tan, brown    orange
#> 2                 IG-88    200   140       none            metal       red
#> 3                   BB8     NA    NA       none             none     black
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> #   starships <list>

# (+) Of which species are there at least 2 different gender values?
sw %>%
  group_by(species, gender) %>%
  count() %>%  # table shows species by gender: 
  group_by(species) %>%  # Which species appear more than once in this table? 
  count() %>%
  filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups:   species [5]
#>    species    nn
#>      <chr> <int>
#> 1    Droid     2
#> 2    Human     2
#> 3 Kaminoan     2
#> 4  Twi'lek     2
#> 5     <NA>     2

## (d) Homeworld issues: ----- 

# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from? 
sw %>%
  group_by(homeworld) %>%
  count() %>%
  arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups:   homeworld [49]
#>    homeworld     n
#>        <chr> <int>
#>  1     Naboo    11
#>  2  Tatooine    10
#>  3      <NA>    10
#>  4  Alderaan     3
#>  5 Coruscant     3
#>  6    Kamino     3
#>  7  Corellia     2
#>  8  Kashyyyk     2
#>  9    Mirial     2
#> 10    Ryloth     2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)

# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld? 
sw %>% 
  filter(homeworld == "Naboo", eye_color == "orange") %>%
  summarise(n = n(),
            mn_height = mean(height))
#> # A tibble: 1 x 2
#>       n mn_height
#>   <int>     <dbl>
#> 1     3  208.6667

## Note: 
sw %>% filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#>                    name height  mass hair_color          skin_color
#>                   <chr>  <int> <dbl>      <chr>               <chr>
#> 1 Jabba Desilijic Tiure    175  1358       <NA>    green-tan, brown
#> 2                Ackbar    180    83       none        brown mottle
#> 3         Jar Jar Binks    196    66       none              orange
#> 4          Roos Tarpals    224    82       none                grey
#> 5            Rugor Nass    206    NA       none               green
#> 6               Sebulba    112    40       none           grey, red
#> 7        Ben Quadinaros    163    65       none grey, green, yellow
#> 8           Saesee Tiin    188    NA       none                pale
#> # ... with 8 more variables: eye_color <chr>, birth_year <dbl>,
#> #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>


# (+) What is the mass and homeworld of the smallest droid?
sw %>% 
  filter(species == "Droid") %>%
  arrange(height)
#> # A tibble: 5 x 13
#>    name height  mass hair_color  skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl>      <chr>       <chr>     <chr>      <dbl>  <chr>
#> 1 R2-D2     96    32       <NA> white, blue       red         33   <NA>
#> 2 R5-D4     97    32       <NA>  white, red       red         NA   <NA>
#> 3 C-3PO    167    75       <NA>        gold    yellow        112   <NA>
#> 4 IG-88    200   140       none       metal       red         15   none
#> 5   BB8     NA    NA       none        none     black         NA   none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## (e) Size and mass: Group summaries: ----- 

# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
  filter(species == "Droid") %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            md_height = median(height, na.rm = TRUE),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#>       n not_NA_h md_height mn_height sd_height
#>   <int>    <int>     <dbl>     <dbl>     <dbl>
#> 1     5        4       132       140  52.00641

# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
  group_by(species) %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            not_NA_m = sum(!is.na(mass)),
            md_mass = median(mass, na.rm = TRUE)
            )
h_m
#> # A tibble: 38 x 6
#>      species     n not_NA_h mn_height not_NA_m md_mass
#>        <chr> <int>    <int>     <dbl>    <int>   <dbl>
#>  1    Aleena     1        1   79.0000        1    15.0
#>  2  Besalisk     1        1  198.0000        1   102.0
#>  3    Cerean     1        1  198.0000        1    82.0
#>  4  Chagrian     1        1  196.0000        0      NA
#>  5  Clawdite     1        1  168.0000        1    55.0
#>  6     Droid     5        4  140.0000        4    53.5
#>  7       Dug     1        1  112.0000        1    40.0
#>  8      Ewok     1        1   88.0000        1    20.0
#>  9 Geonosian     1        1  183.0000        1    80.0
#> 10    Gungan     3        3  208.6667        2    74.0
#> # ... with 28 more rows

# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#>          species     n not_NA_h mn_height not_NA_m md_mass
#>            <chr> <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Yoda's species     1        1        66        1      17
#> 2         Aleena     1        1        79        1      15
#> 3           Ewok     1        1        88        1      20

# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>%  slice(1:3)
#> # A tibble: 3 x 6
#>   species     n not_NA_h mn_height not_NA_m md_mass
#>     <chr> <int>    <int>     <dbl>    <int>   <dbl>
#> 1    Hutt     1        1       175        1    1358
#> 2 Kaleesh     1        1       216        1     159
#> 3 Wookiee     2        2       231        2     124


## (+) Other questions: ----- 

# (f) How many individuals come from the 3 most frequent (known) species?
sw %>%
  group_by(species) %>%
  count %>%
  arrange(desc(n)) %>%
  filter(n > 1)
#> # A tibble: 9 x 2
#> # Groups:   species [9]
#>    species     n
#>      <chr> <int>
#> 1    Human    35
#> 2    Droid     5
#> 3     <NA>     5
#> 4   Gungan     3
#> 5 Kaminoan     2
#> 6 Mirialan     2
#> 7  Twi'lek     2
#> 8  Wookiee     2
#> 9   Zabrak     2

# (g) Which individuals are more than 20% lighter (in terms of mass) 
#     than the average mass of individuals of their own homeworld?
sw %>%
  select(name, homeworld, mass) %>%
  group_by(homeworld) %>%
  mutate(n_notNA_mass = sum(!is.na(mass)),  
         mn_mass = mean(mass, na.rm = TRUE),
         lighter = mass < (mn_mass - (.20 * mn_mass))
         ) %>%
  filter(lighter == TRUE)
#> # A tibble: 5 x 6
#> # Groups:   homeworld [4]
#>            name homeworld  mass n_notNA_mass  mn_mass lighter
#>           <chr>     <chr> <dbl>        <int>    <dbl>   <lgl>
#> 1         R2-D2     Naboo    32            6 64.16667    TRUE
#> 2   Leia Organa  Alderaan    49            2 64.00000    TRUE
#> 3         R5-D4  Tatooine    32            8 85.37500    TRUE
#> 4          Yoda      <NA>    17            3 82.00000    TRUE
#> 5 Padmé Amidala     Naboo    45            6 64.16667    TRUE

More on data transformation

For more details on dplyr,

Visualizing data

Creating good graphs is both an art and a craft. A transparent visualization of data can promote insights before and beyond any mathematical analysis or statistical test. However, creating good graphs requires a thorough understanding of the data, the visual properties of graphs, and the tools that allow turning data into graphs. One such tool is the package `ggplot2, which implements a so-called “grammer of graphics” for R.

In the following, we introduce some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.

See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) and the links provided below for more detailed information.

Structure of ggplot commands

A generic template for creating a graph with ggplot is:

# Generic ggplot template: 
ggplot(data = <DATA>) + 
  <GEOM_fun>(mapping = aes(<MAPPING>), <arg_1 = val_1, ..., arg_n = val_n>) +
  <FACET_fun> +    # optional
  <LOOK_GOOD_fun>  # optional 
  
# Minimal ggplot template:
ggplot(<DATA>) + 
  <GEOM_fun>(aes(<MAPPING>) 

The generic template includes the following parts:

  • <DATA> is a data frame or tibble that contains the data that is to be plotted.

  • <GEOM_fun> is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified in aes(<MAPPING>). (A “mapping” specifies what goes where.)

  • A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
    1. in the aesthetic mapping (when varying visual features according to data properties), or
    2. by setting its arguments to specific values in <arg_1 = val_1, ..., arg_n = val_n> (when remaining constant).
  • An optional <FACET_fun> splits a complex plot into multiple subplots.

  • A sequence of optional <LOOK_GOOD_fun> adjusts the visual features of plots (e.g., by adding themes, plot titles and labels, color scales, and coordinate systems).

Basic plot types

ggplot2 contains dozens of geoms. Fortunately, many graphical messages can be conveyed by mastering a few basic plot types. In the following, we show some examples that illustrate the use of ggplot commands (and should get you started to explore other geoms).

1. Histograms

A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:

library(ggplot2)

# Data: ------ 
# Using mpg data:
?ggplot2::mpg
mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (A) Histogram: ------

# A minimal histogram:
hi1 <- ggplot(mpg, aes(x = cty)) +  # set mappings for ALL geoms
  geom_histogram(binwidth = 1) 
hi1


# The same histogram:
hi1b <- ggplot(mpg) +
  geom_histogram(aes(x = cty))      # set mappings for THIS geoms
hi1b


# (B) Adding aesthetics, labels and themes: ------ 

# Enhanced version of the same plot: 
hi2 <- ggplot(mpg) +
  geom_histogram(aes(x = cty), binwidth = 1, fill = "forestgreen", color = "black") +
  labs(title = "Distribution of fuel economy in city environments", 
       x = "cty (miles per gallon)",
       caption = "Data from ggplot2::mpg") +
  theme_light()
hi2

More on histograms:

2. Scatterplots

A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data:


# (A) Scatterplot: ------ 

# A minimal scatterplot + reference line:
sp1 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline()
sp1

Dealing with overplotting

A common issue with scatterplots is so-called overplotting: Multiple points appear on the same position.

Here are some ways of dealing with this issue:

  1. jitter adds randomness to positions;
  2. alpha uses transparency to show frequency of positions;
  3. geom_size allows mapping values (e.g., frequency) to object size;
  4. facet_wrap allows disentangling plots by levels of variables.

Some examples include:

## Dealing with overplotting: ----- 

# 1. One way of dealing with overplotting is 
#    adding randomness to point positions:  
sp2 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "jitter") +
  geom_abline()
sp2


# 2. Another way of dealing with overplotting is 
#    using transparency (via setting alpha to < 1): 
sp3 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "identity", 
             pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
  geom_abline(linetype = 2, color = "firebrick") # + 
  # geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
sp3


# Adding labels and themes to plots: 
sp4 <- sp3 +   # use the plot defined above
  labs(title = "Fuel economy on highway vs. city",
                x = "City (miles per gallon)",
                y = "Highway (miles per gallon)",
                caption = "Data from ggplot2::mpg") +
  # coord_fixed() +
  theme_bw()
sp4


# (C) Grouping (by a categorical variable): ------  

# Using facets to avoid overplotting: 
sp5 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline() + 
  facet_wrap(~class) +
  theme_bw()
sp5


# Grouping by color:
sp6 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy, color = class), 
             position = "jitter", alpha = 1/2, size = 4) +
  geom_abline(linetype = 2) +
  theme_bw()
sp6


# Grouping by facets: 
sp7 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), 
             position = "jitter", alpha = 1/2, size = 2) +
  geom_abline(linetype = 2) +
  facet_wrap(~class) +
  theme_bw()
sp7

Note some details:

  • ggplot requires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted <DATA> is in a table (data frame or tibble) in long format and contains independent variables as factors.

  • The arguments data = and mappings = can be omitted, but an aesthetic mapping aes(<MAPPING>) for at least one geom is needed.

  • Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).

  • When multiple geoms use the same mappings, their common aes(<MAPPING>) can be moved into the initial ggplot call (behind <DATA>).

  • In ggplot, a sequence of commands is combined by +, rather than %>%.

  • The visual appearance of plots are highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots).

More on scatterplots:

3. Bar plots

Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:

Counts of cases

By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):

library(ggplot2)

## Data: 
ggplot2::mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (1) Count number of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class))


# (b) is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..))


# (c) is the same as:
ggplot(mpg) + 
  geom_bar(aes(x = class), stat = "count")


# (d) is the same as:
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..), stat = "count")


# (e) pimped version:
ggplot(mpg) + 
  geom_bar(aes(x = class, fill = class), 
           # stat = "count", 
           color = "black") + 
  labs(title = "Counts of cars by class",
       x = "Class of car", y = "Frequency") + 
  scale_fill_brewer(name = "Class:", palette = "Blues") + 
  theme_bw()

Practice: Plot the number or frequency of cases in the mpg data by cyl (in at least 3 different ways).

Proportion of cases

An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:

library(ggplot2)

## Data: 
ggplot2::mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (1) Proportion of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..prop.., group = 1))


# is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count../sum(..count..)))

Practice: Plot the proportion of cases in the mpg data by cyl (in at least 3 different ways).

Bar plots of existing values

A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").

For instance, let’s plot a bar chart that shows the election data from the following tibble de:

year party share
2013 CDU/CSU 0.415
2013 SPD 0.257
2013 Others 0.328
2017 CDU/CSU 0.330
2017 SPD 0.205
2017 Others 0.465
  1. A version with 2 x 3 separate bars (using position = "dodge"):
## Data: ----- 
de  # => 6 x 3 tibble
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)

## (1) Bar chart with  side-by-side bars (dodge): ----- 

## (a) minimal version: 
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (A) 3 bars per election (position = "dodge"):  
  geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1


## (b) Version with text labels and customized colors: 
bp_1 + 
  ## pimping plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .01), 
            position = position_dodge(width = 1), 
            fontface = 2, color = "black") + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_bw()

  1. A version with 2 bars with 3 segments (using position = "stack"):
## Data: ----- 
de  # => 6 x 3 tibble
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## (2) Bar chart with stacked bars: -----  

## (a) minimal version: 
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (B) 1 bar per election (position = "stack"):
  geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2


## (b) Version with text labels and customized colors: 
bp_2 +   
  ## Pimping plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%")), 
            position = position_stack(vjust = .5),
            color = rep(c("black", "white", "white"), 2), 
            fontface = 2) + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_classic()

Bar plots with error bars

It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:

## Create data to plot: ----- 
n_cat <- 6
set.seed(101)

data <- tibble(
  name = LETTERS[1:n_cat],
  value = sample(seq(25, 50), n_cat),
  sd = rnorm(n = n_cat, mean = 0, sd = 8))
data
#> # A tibble: 6 x 3
#>   name  value     sd
#>   <chr> <int>  <dbl>
#> 1 A        34  1.71 
#> 2 B        26  2.49 
#> 3 C        42  9.39 
#> 4 D        40  4.95 
#> 5 E        30 -0.902
#> 6 F        31  7.34

## Error bars: -----

## x-aesthetic only:

# (a) errorbar: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
    geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd), 
                  width = 0.4, color = "orange", alpha = 1, size = 1.0)


# (b) linerange: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
    geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd), 
                   color = "firebrick", alpha = 1, size = 2.5)


## Additional y-aesthetic: 

# (c) crossbar:
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "tomato4") +
    geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                  width = 0.3, color = "sienna1", alpha = 1, size = 1.0)


# (d) pointrange: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "burlywood4") +
    geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                    color = "gold", alpha = 1.0, size = 1.2)

More on barplots:

4. Lines and curves

There are many types of lines. In this section, we introduce some basic types.

  1. Straight and curved lines: When using lines to illustrate boundaries, limits, or trends in plots, we can add them by specifying their key parameters (e.g., their intercept, slope, etc.).
# Draw some basic lines:

# (_) Draw empty plot canvas: 
ggplot()


# (a) Draw basic lines (by linear equation):
ggplot() +
  geom_abline(linetype = 2, color = "forestgreen") +  # dotted diagnonal 
  geom_abline(intercept = 1/3, slope = 1/3)           # y = .333 + .333 x 

# Note the absence of labels on axes!

# (b) Add vertical lines: 
ggplot() +
  geom_abline(linetype = 2, color = "forestgreen") + 
  geom_abline(intercept = 1/3, slope = 1/3) + 
  geom_hline(yintercept = .50, color = "firebrick")   # horizontal line

# Note: Labels on y-axis are added automatically. 

# (c) Add horizontal lines: 
ggplot() +
  geom_abline(linetype = 2, color = "forestgreen") + 
  geom_abline(intercept = 1/3, slope = 1/3) + 
  geom_hline(yintercept = .50, color = "firebrick") + 
  geom_vline(xintercept = .75, color = "steelblue")   # vertical line

# Note: Labels on x-axis are added automatically.   

# (d) Add line segments (with start and end points):
ggplot() +
  geom_abline(linetype = 2, color = "forestgreen") + 
  geom_abline(intercept = 1/3, slope = 1/3) + 
  geom_hline(yintercept = .50, color = "firebrick") + 
  geom_vline(xintercept = .75, color = "steelblue") +
  geom_segment(aes(x = 1/4, y = 1, xend = 1, yend = 1/4), 
               color = "gold", arrow = NULL)          # line segment 

# Note: To draw arrows, replace NULL by an arrow specification like 
# arrow(angle = 30, length = unit(0.5, "cm"), ends = "first", type = "closed")

# (e) Add curve (with start and end points):
ggplot() +
  geom_abline(linetype = 2, color = "forestgreen") + 
  geom_abline(intercept = 1/3, slope = 1/3) + 
  geom_hline(yintercept = .50, color = "firebrick") + 
  geom_vline(xintercept = .75, color = "steelblue") +
  geom_segment(aes(x = 1/4, y = 1, xend = 1, yend = 1/4), 
               color = "gold", arrow = NULL) + 
  geom_curve(aes(x = 1/3, y = 2/3, xend = 1, yend = 1/3), 
               color = "orange", curvature = -.6)      # curve


# (+) Prettify plot:
ggplot() +
  geom_abline(linetype = 2, color = "forestgreen") + 
  geom_abline(intercept = 1/3, slope = 1/3) + 
  geom_hline(yintercept = .50, color = "firebrick") + 
  geom_vline(xintercept = .75, color = "steelblue") +
  geom_segment(aes(x = 1/4, y = 1, xend = 1, yend = 1/4), 
               color = "gold", arrow = NULL) + 
  geom_curve(aes(x = 1/3, y = 2/3, xend = 1, yend = 1/3), 
               color = "orange", curvature = -.6) + 
  labs(title = "Plotting basic lines", 
       x = "x-value", y = "y-value", 
       caption = "[ds4psy]") + 
  theme_bw()

  1. Drawing functions: A more general approach to drawing lines is using functions that define the value of y as a computation on some value x:
## Drawing functions:

# (a) Define some functions:
fn0 <- function(x){x}
fn1 <- function(x){1/3 * x + 1/3} 
fn2 <- function(x){x^2 - x}
fn3 <- function(x){-log(abs(x))}
fn4 <- function(x){2^x}
fn5 <- function(x){2 * sin(x)}
fn6 <- function(x){rnorm(x, mean = 0, sd = 1)}     # random value from normal dist. 
fn7 <- function(x){rbinom(x, size = 1, prob = .5)} # random value from binom. dist.

# (b) Empty plotting canvas:
ggplot(data.frame(x = c(-10, 10)), aes(x = x))   # empty canvas from -10 < x < +10


# (c) Draw functions with stat_function(): 
ggplot(data.frame(x = c(-10, 10)), aes(x = x)) + 
    stat_function(fun = fn0, color = "black") + 
    stat_function(fun = fn1, color = "steelblue") + 
    stat_function(fun = fn2, color = "forestgreen") +   
    stat_function(fun = fn3, color = "firebrick") + 
    stat_function(fun = fn4, color = "gold") + 
    stat_function(fun = fn5, color = "orange") + 
    stat_function(fun = fn6, color = "grey50") + 
    stat_function(fun = fn7, color = "grey75") + 
    ## Prettify plot: ## 
    labs(title = "Plotting different functions", 
         caption = "[ds4psy]") + 
    coord_cartesian(xlim = c(-3, +3), ylim = c(-3, +3)) +  # zoom in on plot region +
    theme_bw()                                             # use bw theme

  1. Line plots of data: When we have grouped data (e.g., some values measured repeatedly over time) it often makes sense to show their development as a line plot. For instance, imagine having taken the following measurements of 3 people over the days of 1 week:
name Mon Tue Wed Thu Fri Sat Sun
Adam 2.5 3.6 3.8 4.2 4.4 2.8 3.2
Beta 3.3 2.9 3.0 2.1 2.3 2.5 3.9
Civo 4.2 4.8 4.0 3.1 3.9 3.7 2.1

We can easily define this data as a tibble (e.g., row-by-row, using the tribble command), but then encounter a problem: To use geom_line we need to define a mapping from some variable x to some variable y. However, we do not have an individual variable x here, but rather 7 values of x for every person (for different days of the week). To obtain a single variable that contains all dependent values for x, we need to re-format the data from wide to long format (see Chapter 12: Tidy data, which introduces the tidyr package).

# (a) Data tibble (in wide format):
tb <- tribble(
  ~name, ~Mon, ~Tue, ~Wed, ~Thu, ~Fri, ~Sat, ~Sun,   
  #-----|-----|-----|-----|-----|-----|-----|-----| 
  "Adam", 2.5, 3.6,  3.8,  4.2,  4.4,  2.8,  3.2,          
  "Beta", 3.3, 2.9,  3.0,  2.1,  2.3,  2.5,  3.9,
  "Civo", 4.2, 4.8,  4.0,  3.1,  3.9,  3.7,  2.1      
)

tb  # print data (in wide format):
#> # A tibble: 3 x 8
#>    name   Mon   Tue   Wed   Thu   Fri   Sat   Sun
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  Adam   2.5   3.6   3.8   4.2   4.4   2.8   3.2
#> 2  Beta   3.3   2.9   3.0   2.1   2.3   2.5   3.9
#> 3  Civo   4.2   4.8   4.0   3.1   3.9   3.7   2.1

# (b) Re-format from wide to long format (using tidyr commands):
tb_long <- tb %>%
  gather(Mon:Sun, key = "day", value = "val") %>%
  arrange(name)

tb_long  # print data (in long format):
#> # A tibble: 21 x 3
#>     name   day   val
#>    <chr> <chr> <dbl>
#>  1  Adam   Mon   2.5
#>  2  Adam   Tue   3.6
#>  3  Adam   Wed   3.8
#>  4  Adam   Thu   4.2
#>  5  Adam   Fri   4.4
#>  6  Adam   Sat   2.8
#>  7  Adam   Sun   3.2
#>  8  Beta   Mon   3.3
#>  9  Beta   Tue   2.9
#> 10  Beta   Wed   3.0
#> # ... with 11 more rows

# (c) Line plot of tb_long: 
ggplot(tb_long, aes(x = day, y = val, group = name, color = name)) +
  geom_line(size = 1.0)

# However, note that x-axis labels are ordered alphabetically! 
# The reason for this is that -- in tb_long -- day is a character variable. 
# To fix this, we need to turn day into a factor with levels that match its values: 

Note that the labels on the x-axis are ordered alphabetically (i.e., from “Fri”" to “Wed”). The reason for this is that – in tb_long – the variable day is of type character.
To fix this problem, we need to turn day into a factor with levels that match its values:

# (d) Turn day into a factor:
tb_long$day <- factor(tb_long$day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
# tb_long

# (e) Repeat (c) with tb_long$day as factor:
ggplot(tb_long, aes(x = day, y = val, group = name, color = name)) +
  geom_line(size = 1.0)

# Note that order of weekdays is not correct (corresponding to factor levels). 

# (f) A prettier version of the same plot:
ggplot(tb_long, aes(x = day, y = val, group = name, color = name, shape = name)) +
  geom_line(size = 1.0) +
  geom_point(size = 2.5) +
  labs(title = "Line plot of data", 
       x = "Day of week", y = "Measurement", 
       caption = "[ds4psy]") + 
  scale_color_brewer(palette = "Set1") + 
  theme_bw()

5. Box plots

+++ here now +++

ToDo:

  • show medians, quartiles, distribution, and outliers
  • also show means
  • also show raw data

Improving plots

Most default plots can be improved by fine-tuning their visual appearance. Popular levers for “pimping” plots include:

  • colors, shapes and sizes can be set withing geoms (and are variable when inside aes(...), or fixed when set outside). Using color often involves setting specific color scales;
  • labels are essential for understanding plots: labs(...) allows setting titles, captions, axis labels, etc.;
  • legends can be crucial for understanding aesthetic mappings. They can be edited or (re-)moved;
  • themes allow for a consistent look, can be selected and modified.

Colors

  • Colors of lines, points, shapes, and text
  • Choosing color scales

Symbols

Setting the shape parameter to different pch types.

# Create an empty chart (in base R): 
plot(1, 1, xlim = c(1, 5.5), ylim = c(0, 7), type = "n", ann = FALSE)

# Plot digits 0-4 with increasing size and color
text(1:5, rep(6, 5), labels = c(0:4), cex = 1:5, col = 1:5)

# Plot symbols 0-4 with increasing size and color
points(1:5, rep(5,5), cex = 1:5, col = 1:5, pch = 0:4)
text((1:5)+0.4, rep(5,5), cex = 0.6, (0:4))

# Plot symbols 5-9 with labels
points(1:5, rep(4,5), cex = 2, pch = (5:9))
text((1:5)+0.4, rep(4,5), cex = 0.6, (5:9))

# Plot symbols 10-14 with labels
points(1:5, rep(3,5), cex = 2, pch = (10:14))
text((1:5)+0.4, rep(3,5), cex = 0.6, (10:14))

# Plot symbols 15-19 with labels
points(1:5, rep(2,5), cex = 2, pch = (15:19))
text((1:5)+0.4, rep(2,5), cex = 0.6, (15:19))

# Plot symbols 20-25 with labels
points((1:6)*0.8+0.2, rep(1, 6), cex = 2, pch = (20:25))
text((1:6)*0.8+0.5, rep(1, 6), cex = 0.6, (20:25))

Canvas and annotations

  • Titles, axis labels, etc.
  • Legends
  • Grid lines
  • Themes

More on data visualization

Data exploration

This section summarizes some essential parts of Chapter 7: Exploratory data analysis (EDA).

Defining EDA

In the introduction to data visualization, we mentioned that creating good graphs is both an art and a craft. This implies that a recipe for creating good graphs involves three ingredients:

  1. a solid understanding of the data involved,
  2. the right set of tools to deal with data, and
  3. lots of dedicated practice in using these tools to solve concrete tasks.

This recipe can be extended beyond graphs, as a mixture of the same ingredients is needed for all aspects of data analysis. For instance, when obtaining and exploring a new dataset, it is both an art and a craft to quickly obtain a good understanding of its contents. Exploratory data analysis (EDA) is the process of getting a grasp of new data. Efficient and effective EDA requires combining commands on tibbles (tibble), data visualization (ggplot2), and data transformation (dplyr).

Basic questions

Getting a grasp of some data requires understanding two inter-related aspects:

  1. Semantics: What is the meaning and functional role of the observations?

    • What are the units of analysis (cases or observations)?
    • What variables exist for each case/observation (e.g., multiple measures for each case)?
    • What are relationships between observations (e.g., repeated measurements) or variables (e.g., correlations)?
    • What are independent vs. dependent variables (of an experiment)?
  2. Formats: What data types are contained in the data and how are they arranged?

    • How is the data formatted (in rows vs. columns)?
    • What types of variables (columns) exist?
    • Is the data tidy? (See the definition in Chapter 12: Tidy data.)

Answering all these questions is often difficult and requires many small steps that analyze and transform a dataset.
In the following, we will illustrate the most common steps.

Typical steps

Here are some basic questions to answer whenever we get (load or create) a new data file:

  • What are the dimensions of the data?
  • What types of variables (columns) are involved?
  • What are the cases or observations (rows)?
  • What are the ranges, distributions, and unexpected values (e.g., missing data and outliers) of variables (columns)?
  • What are the relationships between variables?

Dealing with missing data and outliers

ToDo

Plotting distributions and relations

Creating good graphs is both an art and a craft, but also allows a quick overview of an unknown set of data. The key to creating good graphs requires answering 2 sets of questions:

  1. Knowing the number and type of variables to be plotted. This includes answering data-related questions like

    • How many variables are there to plot?
    • Are these variables categorical or continuous?
    • Do some variables qualify (e.g., group) the values of others?
  2. Knowing the intended type of plot. This includes answering functional questions like

    • What is the purpose of this plot?
    • What are possible plots for this purpose?
    • Which of these would be the most appropriate plot?

Even when the questions of 1. and 2. are answered, creating good graphs with ggplot requires a lot of practice and many hours of trial-and-error experimentation.

Histograms

A histogram shows counts of the values of 1 (typically continuous) variable. This is useful for evaluating the distribution of the variable:

library(ggplot2)
 
# Create data: 
tb <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 15))
 
# Basic histogram:
ggplot(tb) + 
  geom_histogram(aes(x = iq), binwidth = 5)


# Pimped histogram: 
ggplot(tb) + 
  geom_histogram(aes(x = iq), binwidth = 5, 
                 fill = "gold", color = "black") +
  labs(title = "Histogram", x = "IQ values", y = "Frequency in sample (n)",
       caption = "[Using random iq data.]") +
  theme_classic()

More on histograms:

Scatterplots

A scatterplot shows relationship between 2 (typically continuous) variables:

# Data:
ir <- as_tibble(iris)
ir
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1         5.10        3.50         1.40       0.200 setosa 
#>  2         4.90        3.00         1.40       0.200 setosa 
#>  3         4.70        3.20         1.30       0.200 setosa 
#>  4         4.60        3.10         1.50       0.200 setosa 
#>  5         5.00        3.60         1.40       0.200 setosa 
#>  6         5.40        3.90         1.70       0.400 setosa 
#>  7         4.60        3.40         1.40       0.300 setosa 
#>  8         5.00        3.40         1.50       0.200 setosa 
#>  9         4.40        2.90         1.40       0.200 setosa 
#> 10         4.90        3.10         1.50       0.100 setosa 
#> # ... with 140 more rows

# Basic scatterplot:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species))


# Using 3 different facets:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  facet_wrap(~Species)


# Pimped scatterplot:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, fill = Species), pch = 21, color = "black", size = 2, alpha = 1/2) +
  facet_wrap(~Species) +
  # coord_fixed() + 
  labs(title = "Scatterplot", x = "Length of petal", y = "Width of petal",
       caption = "[Using iris data.]") + 
  theme_bw() +
  theme(legend.position = "none")

More on scatterplots:

(…)

Tidy data

Chapter 12: Tidy data teaches a consistent way to organise tabular data. It introduces commands of the tidyr package, which is a core member of the tidyverse.

Tabular data

In R, rectangular data is often organized in tibbles or data frames. Importantly, each column is a vector (of a particular type) that contains the values of a variable. Thus, whereas every column must be of one type, every row can contain values of different variables and types.

The same set of data (values of variables) can be organised in many different ways. For instance, the following tables (or tibbles) all provide the number of TB cases documented by the World Health Organization in 3 countries (Afghanistan, Brazil, and China) in 2 years (1999 and 2000):

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

library(tidyverse)

## Example of the same data organised in 4 different ways:
# ?table1 # for semantics and source of data

tidyr::table1
#> # A tibble: 6 x 4
#>       country  year  cases population
#>         <chr> <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3      Brazil  1999  37737  172006362
#> 4      Brazil  2000  80488  174504898
#> 5       China  1999 212258 1272915272
#> 6       China  2000 213766 1280428583

tidyr::table2
#> # A tibble: 12 x 4
#>        country  year       type      count
#>          <chr> <int>      <chr>      <int>
#>  1 Afghanistan  1999      cases        745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000      cases       2666
#>  4 Afghanistan  2000 population   20595360
#>  5      Brazil  1999      cases      37737
#>  6      Brazil  1999 population  172006362
#>  7      Brazil  2000      cases      80488
#>  8      Brazil  2000 population  174504898
#>  9       China  1999      cases     212258
#> 10       China  1999 population 1272915272
#> 11       China  2000      cases     213766
#> 12       China  2000 population 1280428583

tidyr::table3
#> # A tibble: 6 x 3
#>       country  year              rate
#> *       <chr> <int>             <chr>
#> 1 Afghanistan  1999      745/19987071
#> 2 Afghanistan  2000     2666/20595360
#> 3      Brazil  1999   37737/172006362
#> 4      Brazil  2000   80488/174504898
#> 5       China  1999 212258/1272915272
#> 6       China  2000 213766/1280428583

tidyr::table4a
#> # A tibble: 3 x 3
#>       country `1999` `2000`
#> *       <chr>  <int>  <int>
#> 1 Afghanistan    745   2666
#> 2      Brazil  37737  80488
#> 3       China 212258 213766
tidyr::table4b
#> # A tibble: 3 x 3
#>       country     `1999`     `2000`
#> *       <chr>      <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2      Brazil  172006362  174504898
#> 3       China 1272915272 1280428583

tidyr::table5
#> # A tibble: 6 x 4
#>       country century  year              rate
#> *       <chr>   <chr> <chr>             <chr>
#> 1 Afghanistan      19    99      745/19987071
#> 2 Afghanistan      20    00     2666/20595360
#> 3      Brazil      19    99   37737/172006362
#> 4      Brazil      20    00   80488/174504898
#> 5       China      19    99 212258/1272915272
#> 6       China      20    00 213766/1280428583

Practice: Recreate the above bar plot using ggplot2 with data = table1.

Defining tidy data

Definition: A tidy dataset conforms to 3 interrelated rules:

  1. Each variable must have its own column.

  2. Each case/observation must have its own row.

  3. Each value must have its own cell.

See http://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure for a graphical illustration of these rules.

The 3 rules defining tidy data are connected, as it is impossible to only satisfy 2 of the 3. This leads to a simpler set of practical instructions for tidying a messy set of data:

  1. turn each dataset into a tibble.
  2. put each variable into a column.

Note that we need to interpret the semantics of the variables to understand whether a data set is tidy.

Practice: Which of the data tables in the above example (table1 to table5) are tidy? Why or why not?

Advantages of tidy data

  1. Consistency: Consistent data structures make it easier to learn the tools that work with it because they have an underlying uniformity.

  2. Vectorization: Placing variables in columns allows R’s vectorised nature to shine. For instance, the basic verbs of dplyr (and most built-in R functions) work with vectors of values. That makes transforming tidy data easy and natural.

  3. Matching data and tools: Packages like dplyr, ggplot2, and many others are designed to work with tidy data.

Commands and examples

We consider 2 pairs of 2 complementary commands as essential:

  1. separate splits 1 variable into 2 variables;
  2. unite combines 2 variables into 1 variable;
  3. gather makes wide data longer (by gathering many variables into 1);
  4. spread makes long data wider (by spreading 1 variable into many).

separate is the complement/opposite of unite and spread is the complement/opposite of gather.

Here are some basic examples for using these 4 commands:

1. separate a variable

separate splits 1 variable (column) into multiple variables (columns) – at a position where some separator character appears – and is the complement to unite. Using separate requires the following arguments:

  • some tibble/data frame data;
  • the variable (column) col to be separated (specified by its name or column number);
  • the names of the new variables (columns) into which col is to be split (specified as a character vector);
  • the separator character sep (as a character/regular expression).

An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.

# Data to use: 
tidyr::table3  # Note that column rate contains 2 numbers, separated by "/". 
#> # A tibble: 6 x 3
#>       country  year              rate
#> *       <chr> <int>             <chr>
#> 1 Afghanistan  1999      745/19987071
#> 2 Afghanistan  2000     2666/20595360
#> 3      Brazil  1999   37737/172006362
#> 4      Brazil  2000   80488/174504898
#> 5       China  1999 212258/1272915272
#> 6       China  2000 213766/1280428583

## Basics: ----- 

# Full separate command:
separate(data = table3, col = rate, into = c("cases", "population"), sep = "/")
#> # A tibble: 6 x 4
#>       country  year  cases population
#> *       <chr> <int>  <chr>      <chr>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3      Brazil  1999  37737  172006362
#> 4      Brazil  2000  80488  174504898
#> 5       China  1999 212258 1272915272
#> 6       China  2000 213766 1280428583
# Note that "/" disappears from output tibble.

# Shorter versions of the same command:
separate(table3, rate, c("cases", "population"))
#> # A tibble: 6 x 4
#>       country  year  cases population
#> *       <chr> <int>  <chr>      <chr>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3      Brazil  1999  37737  172006362
#> 4      Brazil  2000  80488  174504898
#> 5       China  1999 212258 1272915272
#> 6       China  2000 213766 1280428583

# Using the pipe: 
table3 %>% 
  separate(rate, c("cases", "population"))
#> # A tibble: 6 x 4
#>       country  year  cases population
#> *       <chr> <int>  <chr>      <chr>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3      Brazil  1999  37737  172006362
#> 4      Brazil  2000  80488  174504898
#> 5       China  1999 212258 1272915272
#> 6       China  2000 213766 1280428583

## Variants: ----- 

# Specifying the variable to be split (rate) by its column number (3):
table3 %>% 
  separate(3, c("cases", "population"))
#> # A tibble: 6 x 4
#>       country  year  cases population
#> *       <chr> <int>  <chr>      <chr>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3      Brazil  1999  37737  172006362
#> 4      Brazil  2000  80488  174504898
#> 5       China  1999 212258 1272915272
#> 6       China  2000 213766 1280428583

# Not dropping the original rate variable:
table3 %>% 
  separate(rate, c("cases", "population"), remove = FALSE)
#> # A tibble: 6 x 5
#>       country  year              rate  cases population
#> *       <chr> <int>             <chr>  <chr>      <chr>
#> 1 Afghanistan  1999      745/19987071    745   19987071
#> 2 Afghanistan  2000     2666/20595360   2666   20595360
#> 3      Brazil  1999   37737/172006362  37737  172006362
#> 4      Brazil  2000   80488/174504898  80488  174504898
#> 5       China  1999 212258/1272915272 212258 1272915272
#> 6       China  2000 213766/1280428583 213766 1280428583

The example shows that the argument names (data, col, and into) can be left out (but still require appropriate arguments in the correct order) and sep can be left unspecified when tidyr can make a good guess what the separator character might be.

However, consider the following table6, which is available online and can be read into R via read_csv("http://rpository.com/ds4psy/data/table6.csv"):

## Load data (as comma-separated file): 
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv")  # from online source

## Alternatively (from local source "data/table6.csv"): 
# table6 <- read_csv("data/table6.csv")  # from local directory

table6
#> # A tibble: 6 x 2
#>       country               when_what
#>         <chr>                   <chr>
#> 1 Afghanistan      19_99.745/19987071
#> 2 Afghanistan     20_00.2666/20595360
#> 3      Brazil   19_99.37737/172006362
#> 4      Brazil   20_00.80488/174504898
#> 5       China 19_99.212258/1272915272
#> 6       China 20_00.213766/1280428583

Here, the variable when_what contains several plausible separation characters: _, ., and /. Let’s first see what happens when we fail to provide a separating character sep, and then split the variable when_what in three different ways:

# Data to use: 
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv") # from online source
table6    # Note that column when_what contains several splitting options. 
#> # A tibble: 6 x 2
#>       country               when_what
#>         <chr>                   <chr>
#> 1 Afghanistan      19_99.745/19987071
#> 2 Afghanistan     20_00.2666/20595360
#> 3      Brazil   19_99.37737/172006362
#> 4      Brazil   20_00.80488/174504898
#> 5       China 19_99.212258/1272915272
#> 6       China 20_00.213766/1280428583

# What happens when we do not specify "sep"? 
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"))  # sep is not provided!
#> # A tibble: 6 x 3
#>       country var_1 var_2
#> *       <chr> <chr> <chr>
#> 1 Afghanistan    19    99
#> 2 Afghanistan    20    00
#> 3      Brazil    19    99
#> 4      Brazil    20    00
#> 5       China    19    99
#> 6       China    20    00

# => when_what is split at 1st option (_), but Warning (and loss of data)!

# Specifying different splitting characters:
# (a) split at "_": 
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "_")  # 
#> # A tibble: 6 x 3
#>       country var_1                var_2
#> *       <chr> <chr>                <chr>
#> 1 Afghanistan    19      99.745/19987071
#> 2 Afghanistan    20     00.2666/20595360
#> 3      Brazil    19   99.37737/172006362
#> 4      Brazil    20   00.80488/174504898
#> 5       China    19 99.212258/1272915272
#> 6       China    20 00.213766/1280428583

# (b) split at "." (specified as a regular expression "\\."):
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "\\.")  
#> # A tibble: 6 x 3
#>       country var_1             var_2
#> *       <chr> <chr>             <chr>
#> 1 Afghanistan 19_99      745/19987071
#> 2 Afghanistan 20_00     2666/20595360
#> 3      Brazil 19_99   37737/172006362
#> 4      Brazil 20_00   80488/174504898
#> 5       China 19_99 212258/1272915272
#> 6       China 20_00 213766/1280428583

# (c) split at "/":
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "/")
#> # A tibble: 6 x 3
#>       country        var_1      var_2
#> *       <chr>        <chr>      <chr>
#> 1 Afghanistan    19_99.745   19987071
#> 2 Afghanistan   20_00.2666   20595360
#> 3      Brazil  19_99.37737  172006362
#> 4      Brazil  20_00.80488  174504898
#> 5       China 19_99.212258 1272915272
#> 6       China 20_00.213766 1280428583

Note that using the point or period (.) as a splitting character sep = "." would not work. Instead, we need to use the corresponding regular expression sep = "\\.". (See Chapter 14: Strings for details.)

Practice: Split the when_what variable of table6 3 times to create a tibble table6a that contains 5 variables (columns) and reasonable variable names:

#> # A tibble: 6 x 5
#>   country     century year  cases  population
#> * <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583

2. unite variables

unite combines 2 variables (columns) into 1 variable (column) – adding an optional separator character – and is the complement to separate. Using unite requires the following arguments:

  • some tibble/data frame data;
  • the name of the new compound variable (column) col (specified as a character);
  • the names of the variables (columns) to be combined (specified by their names or column numbers);
  • an optional separator character sep (as a character/regular expression).

An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.

# Data to use: 
tidyr::table5  # Note that columns 2 and 3 contain 2 values (as characters!) that belong together. 
#> # A tibble: 6 x 4
#>       country century  year              rate
#> *       <chr>   <chr> <chr>             <chr>
#> 1 Afghanistan      19    99      745/19987071
#> 2 Afghanistan      20    00     2666/20595360
#> 3      Brazil      19    99   37737/172006362
#> 4      Brazil      20    00   80488/174504898
#> 5       China      19    99 212258/1272915272
#> 6       China      20    00 213766/1280428583

## Basics: ----- 

# Full separate command:
unite(data = table5, col = "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#>       country    yr              rate
#> *       <chr> <chr>             <chr>
#> 1 Afghanistan  1999      745/19987071
#> 2 Afghanistan  2000     2666/20595360
#> 3      Brazil  1999   37737/172006362
#> 4      Brazil  2000   80488/174504898
#> 5       China  1999 212258/1272915272
#> 6       China  2000 213766/1280428583
# Note that century and year variables disappear from output tibble.

# Shorter versions of the same command:
unite(table5, "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#>       country    yr              rate
#> *       <chr> <chr>             <chr>
#> 1 Afghanistan  1999      745/19987071
#> 2 Afghanistan  2000     2666/20595360
#> 3      Brazil  1999   37737/172006362
#> 4      Brazil  2000   80488/174504898
#> 5       China  1999 212258/1272915272
#> 6       China  2000 213766/1280428583

# Using the pipe: 
table5 %>%
  unite("yr", century, year, sep = "")
#> # A tibble: 6 x 3
#>       country    yr              rate
#> *       <chr> <chr>             <chr>
#> 1 Afghanistan  1999      745/19987071
#> 2 Afghanistan  2000     2666/20595360
#> 3      Brazil  1999   37737/172006362
#> 4      Brazil  2000   80488/174504898
#> 5       China  1999 212258/1272915272
#> 6       China  2000 213766/1280428583

## Variants: ----- 

# Providing a different separation character:
table5 %>%
  unite("yr", century, year, sep = "<--|-->")
#> # A tibble: 6 x 3
#>       country          yr              rate
#> *       <chr>       <chr>             <chr>
#> 1 Afghanistan 19<--|-->99      745/19987071
#> 2 Afghanistan 20<--|-->00     2666/20595360
#> 3      Brazil 19<--|-->99   37737/172006362
#> 4      Brazil 20<--|-->00   80488/174504898
#> 5       China 19<--|-->99 212258/1272915272
#> 6       China 20<--|-->00 213766/1280428583

# Specifying the variables to be combined () by their column numbers (2 & 3):
table5 %>% 
  unite("yr", 2, 3, sep = "")
#> # A tibble: 6 x 3
#>       country    yr              rate
#> *       <chr> <chr>             <chr>
#> 1 Afghanistan  1999      745/19987071
#> 2 Afghanistan  2000     2666/20595360
#> 3      Brazil  1999   37737/172006362
#> 4      Brazil  2000   80488/174504898
#> 5       China  1999 212258/1272915272
#> 6       China  2000 213766/1280428583

# Not dropping the original variables:
table5 %>%
  unite("yr", century, year, sep = "", remove = FALSE)
#> # A tibble: 6 x 5
#>       country    yr century  year              rate
#> *       <chr> <chr>   <chr> <chr>             <chr>
#> 1 Afghanistan  1999      19    99      745/19987071
#> 2 Afghanistan  2000      20    00     2666/20595360
#> 3      Brazil  1999      19    99   37737/172006362
#> 4      Brazil  2000      20    00   80488/174504898
#> 5       China  1999      19    99 212258/1272915272
#> 6       China  2000      20    00 213766/1280428583

Practice: Take the data from dplyr::storms and unite the variables year, month, day into 1 variable date.

#> # A tibble: 6 x 11
#>   name  date   hour   lat  long status category  wind pressure ts_diameter
#>   <chr> <chr> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>       <dbl>
#> 1 Amy   1975…  0     27.5 -79.0 tropi… -1          25     1013          NA
#> 2 Amy   1975…  6.00  28.5 -79.0 tropi… -1          25     1013          NA
#> 3 Amy   1975… 12.0   29.5 -79.0 tropi… -1          25     1013          NA
#> 4 Amy   1975… 18.0   30.5 -79.0 tropi… -1          25     1013          NA
#> 5 Amy   1975…  0     31.5 -78.8 tropi… -1          25     1012          NA
#> 6 Amy   1975…  6.00  32.4 -78.7 tropi… -1          25     1012          NA
#> # ... with 1 more variable: hu_diameter <dbl>

Practice: Read the data from read_csv("http://rpository.com/ds4psy/data/table7.csv") into a tibble table7 and inspect its dimension and contents.

  1. Use multiple (4) separate commands to split table7 into a tibble table7a with multiple (5) columns.

  2. Use multiple (4) unite commands on table7a to re-create a tibble table7b that contains all data in 1 column.

Examples of table7 and possible solutions for table7a and table7b:

#> # A tibble: 6 x 1
#>   where_when_what                   
#>   <chr>                             
#> 1 "Afghanistan@19:99$745\\19987071" 
#> 2 "Afghanistan@20:00$2666\\20595360"
#> 3 "Brazil@19:99$37737\\172006362"   
#> 4 "Brazil@20:00$80488\\174504898"   
#> 5 "China@19:99$212258\\1272915272"  
#> 6 "China@20:00$213766\\1280428583"
#> # A tibble: 6 x 5
#>   country     century year  rate   population
#> * <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583
#> # A tibble: 6 x 1
#>   where_when_what               
#> * <chr>                         
#> 1 Afghanistan:1999_745/19987071 
#> 2 Afghanistan:2000_2666/20595360
#> 3 Brazil:1999_37737/172006362   
#> 4 Brazil:2000_80488/174504898   
#> 5 China:1999_212258/1272915272  
#> 6 China:2000_213766/1280428583

3. gather makes wide data longer

Gathering is the opposite of spreading and used when observations that are distributed over multiple columns should be contained in 1 variable (column). More specifically, gather moves the values of several variables (columns) into 1 column value and describes this value by the value of a new key variable. When gathering more than 2 variables, this reduces the number of columns by increasing the number of rows (i.e., makes a wide data set longer).2

Using gather requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or name of the independent variable);
  • value is the name of the variable that is contained in the gathered columns (or the name of the dependent variable);
  • ... or var_x:var_y is a list of variables (columns) to be gathered.
# ?gather # provides documentation

## Data to use: 
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
# Note that counts of cases is distributed over 2 variables (columns) for each country.

## Basics: -----

# gather 2 variables into 1 variable:
gather(data = table4a, 
       key = year, value = cases, 
       `1999`:`2000`)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

# The same command using the pipe:
table4a %>%
  gather(key = year, value = cases, 
         `1999`:`2000`)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

## Variants: ----- 

# The same command with in different order of arguments:
table4a %>%
  gather(`1999`:`2000`, key = year, value = cases)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

# The same command specifying the numbers of the columns to gather:
table4a %>%
  gather(2:3, key = year, value = cases)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

Note that year is of type character in the above example. If we wanted our key variable to be converted into a number (here: an integer), we can add the optional argument convert = TRUE:

## Default: convert = FALSE: 
table4a %>%
  gather(key = year, value = cases, `1999`:`2000`, convert = FALSE)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766
# => year is a character vector.

## Converting year into an integer:
table4a %>%
  gather(key = year, value = cases, `1999`:`2000`, convert = TRUE)
#> # A tibble: 6 x 3
#>   country      year  cases
#>   <chr>       <int>  <int>
#> 1 Afghanistan  1999    745
#> 2 Brazil       1999  37737
#> 3 China        1999 212258
#> 4 Afghanistan  2000   2666
#> 5 Brazil       2000  80488
#> 6 China        2000 213766
# => year is a vector of integers. 

Practice: Save the following data as a tibble de and then turn it into tidy data (by using gather to create a single variable share and listing the election year as an additional variable).

party share_2013 share_2017
CDU/CSU 0.415 0.330
SPD 0.257 0.205
Others 0.328 0.465
## (a) Data saved as a tibble (see above): 
de
#> # A tibble: 3 x 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.330
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

## (b) Converting de into a tidy data table:
de_2 <- de %>%
  gather(share_2013:share_2017, key = "election", value = "share") %>%
  separate(col = election, into = c("dummy", "year")) %>%
  select(year, party, share)

de_2
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

4. spread makes long data wider

Spreading is the opposite of gathering and used when an observation that should be in 1 row is distributed over multiple rows (in 1 column). More specifically, spread puts the values of several cases (rows) into different variables (columns) of 1 row. When spreading more than 2 rows per case, this decreases the number of rows by increasing the number of columns (i.e., makes a long data set wider).3

Using spread requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or the names of the independent variables which become the names of the new columns);
  • value is the name of the variable whose values should be spread over multiple columns (or the name of the dependent variable);

Note that we do not need to specify a range of new columns. The number of new columns is determined by the number of different values in the key variable.

# ?spread # provides documentation

## Data to use: 
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583
# Note that count contains 2 DVs which are described by the values of type. 

## Basics: -----

# spread 2 rows into 2 columns of 1 row:
spread(data = table2, 
       key = type, value = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#> * <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# The same command using the pipe:
table2 %>% 
  spread(key = type, value = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#> * <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# The same shorter: 
table2 %>% 
  spread(type, count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#> * <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583


## Variants: -----

# Use <key><sep><value> to create new column names:
table2 %>% 
  spread(key = type, value = count, sep = ":")
#> # A tibble: 6 x 4
#>   country      year `type:cases` `type:population`
#> * <chr>       <int>        <int>             <int>
#> 1 Afghanistan  1999          745          19987071
#> 2 Afghanistan  2000         2666          20595360
#> 3 Brazil       1999        37737         172006362
#> 4 Brazil       2000        80488         174504898
#> 5 China        1999       212258        1272915272
#> 6 China        2000       213766        1280428583

Practice: Take the 6 x 3 tibble de_2 (from above) and use spread to create a 3 x 3 tibble de_3 that re-creates the original tibble de from it.

## (a) Data from above: 
de_2
#> # A tibble: 6 x 3
#>    year   party share
#> * <chr>  <fctr> <dbl>
#> 1  2013 CDU/CSU 0.415
#> 2  2013     SPD 0.257
#> 3  2013  Others 0.328
#> 4  2017 CDU/CSU 0.330
#> 5  2017     SPD 0.205
#> 6  2017  Others 0.465

## (b) Using spread to put share by year into 2 columns/variables:
de_3 <- de_2 %>% 
  spread(key = year, value = share) %>%
  rename(share_2013 = `2013`,  # restore original variable names
         share_2017 = `2017`)

de_3
#> # A tibble: 3 x 3
#>     party share_2013 share_2017
#> *  <fctr>      <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.330
#> 2     SPD      0.257      0.205
#> 3  Others      0.328      0.465

## (c) Comparing de_3 to de: 
de
#> # A tibble: 3 x 3
#>     party share_2013 share_2017
#>    <fctr>      <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.330
#> 2     SPD      0.257      0.205
#> 3  Others      0.328      0.465
all.equal(de_3, de)
#> [1] TRUE

Practice: Moving stocks from wide to long to wide.

The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):

Stock data example showing the start and end prices of the shares of 3 companies on 3 days.
stock d1_start d1_end d2_start d2_end d3_start d3_end
Amada 2.5 3.6 3.5 4.2 4.4 2.8
Betix 3.3 2.9 3.0 2.1 2.3 2.5
Cevis 4.2 4.8 4.6 3.1 3.2 3.7

a. Create a tibble st that contains this data in this (wide) format.

b. Transform st into a longer table st_long that contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the day and time appear as 2 separate columns.

c. Create a (line) graph that shows the 3 stocks’ end prices (on the y-axis) over the 3 days (on the x-axis).

d. Spread st_long into a wider table that contains start and end prices as 2 distinct variables (columns) for each stock and day.

# library(tidyverse)

## (a) Enter stock data (in wide format) as a tibble:
st <- tribble(
  ~stock, ~d1_start, ~d1_end, ~d2_start, ~d2_end, ~d3_start, ~d3_end,  
  #-----|----------|--------|----------|--------|----------|--------|
  "Amada",   2.5,     3.6,    3.5,       4.2,      4.4,       2.8,            
  "Betix",   3.3,     2.9,    3.0,       2.1,      2.3,       2.5,  
  "Cevis",   4.2,     4.8,    4.6,       3.1,      3.2,       3.7     
)
dim(st)
#> [1] 3 7

## Note data structure: 
## 2 nested factors: day (1 to 3), type (start or end).

## (b) Change from wide to long format 
##     that contains the day (d1, d2, d3) and type (start vs. end) as separate columns:
st_long <- st %>%
  gather(d1_start:d3_end, key = "key", value = "val") %>%
  separate(key, into = c("day", "time")) %>%
  arrange(stock, day, time) # optional: arrange rows
st_long
#> # A tibble: 18 x 4
#>    stock day   time    val
#>    <chr> <chr> <chr> <dbl>
#>  1 Amada d1    end    3.60
#>  2 Amada d1    start  2.50
#>  3 Amada d2    end    4.20
#>  4 Amada d2    start  3.50
#>  5 Amada d3    end    2.80
#>  6 Amada d3    start  4.40
#>  7 Betix d1    end    2.90
#>  8 Betix d1    start  3.30
#>  9 Betix d2    end    2.10
#> 10 Betix d2    start  3.00
#> 11 Betix d3    end    2.50
#> 12 Betix d3    start  2.30
#> 13 Cevis d1    end    4.80
#> 14 Cevis d1    start  4.20
#> 15 Cevis d2    end    3.10
#> 16 Cevis d2    start  4.60
#> 17 Cevis d3    end    3.70
#> 18 Cevis d3    start  3.20

## (c) Plot the end values (on the y-axis) of the 3 stocks over 3 days (x-axis):
st_long %>% 
  filter(time == "end") %>%
  ggplot(aes(x = day, y = val, color = stock, shape = stock)) +
  geom_point(size = 4) + 
  geom_line(aes(group = stock)) +
  ## Pimping plot: 
  labs(title = "End prices of stocks", 
       x = "Day", y = "End price", 
       shape = "Stock:", color = "Stock:") +
  theme_bw()

## (d) Change st_long into a wider format that lists start and end as 2 distinct variables (columns):
st_long %>%
  spread(key = time, value = val) %>%
  mutate(day_nr = parse_integer(str_sub(day, 2, 2))) # optional: get day_nr as integer variable
#> # A tibble: 9 x 5
#>   stock day     end start day_nr
#>   <chr> <chr> <dbl> <dbl>  <int>
#> 1 Amada d1     3.60  2.50      1
#> 2 Amada d2     4.20  3.50      2
#> 3 Amada d3     2.80  4.40      3
#> 4 Betix d1     2.90  3.30      1
#> 5 Betix d2     2.10  3.00      2
#> 6 Betix d3     2.50  2.30      3
#> 7 Cevis d1     4.80  4.20      1
#> 8 Cevis d2     3.10  4.60      2
#> 9 Cevis d3     3.70  3.20      3

More on tidy data

Conclusion

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data
4. Exploring data
5. Tidy data

[Last update on 2018-07-11 09:15:33 by hn.]


  1. This is different in Sankey diagrams, shown https://developers.google.com/chart/interactive/docs/gallery/sankey.

  2. The length and width of a data set are relative terms here: gathering tends to decrease data width by increasing length, spreading tends to decrease data length by increasing width.

  3. Again, the length and width of data sets are relative terms.